Time-series: theory (1/2)

1 Packages

library(tidyverse)  # THE library for data science
library(openxlsx)   # A cool package to deal with Excel files/formats
library(quantmod)   # Package for financial data extraction
library(tsibble)    # TS with dataframes framework 
library(fable)      # Package for time-series models & predictions
library(feasts)     # Package for time-series analysis
library(forecast)   # Another package for TS (forecasting)
library(tseries)    # Simple package for time-series tests
library(plotly)     # For 3D graphs
library(mvtnorm)    # For multivariate Gaussian distributions
set.seed(42)        # Random seed

Time-series models are entirely built to capture and mimic the randomness which we observe in real life situations.
Hence, understanding randomness is key.

2 Random variables

Random variables are mathematical tools (formalizations) that characterize, quite naturally, realizations of events that cannot be exactly predicted, for which there is some form of uncertainty.

To introduce key concepts, we will proceed with basic case studies.

2.1 Example 1: the coin toss

2.1.1 Fair

Perhaps the most simple (and easy to understand) example is the tossing of coins.
Let’s first suppose that the coin is not biased and that there is a 50% of obtaining heads or tails.
We can simulate this with a computer. Below, let’s way that

  • 0 = heads
  • 1 = tails

Let’s repeat the process 10 times:

coins <- rbinom(10, 1, 0.5)
coins
 [1] 1 1 0 1 1 1 1 0 1 1

Now, the most important concept with random variables has to do with the frequency of their realizations.
This is often referred to as the probability that an event can occur.
In the case of a coin toss, there are only two possible outcomes. Hence, the representation of probabilities is easy (two numbers).

tibble(coins = coins) |> ggplot(aes(x = coins)) + geom_bar() 

Now, the particular random draw is somewhat surprising. We would expect something closer to 50%/50%.
But that’s the game of random draws!

In order to get a more balanced result for with high likelihood, there is only one way: to consider much more than 10 draws.

coins <- rbinom(100, 1, 0.5)
coins
  [1] 0 1 1 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1
 [38] 1 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1
 [75] 1 1 0 0 0 0 1 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 0

Et voilà!

tibble(coins = coins) |> ggplot(aes(x = coins)) + geom_bar() 

As the number of draws increases, the frequency will slowly converge to 50%/50%.

2.1.2 Biased

It’s also possible to imagine a biased coin.

game <- rbinom(100, 1, 0.7)
tibble(game = game) |> ggplot(aes(x = game)) + geom_bar() 

2.2 Example 2: rolling a dice

Ok, so coin tossing is easy and overly simplistic. Let’s consider a more complex setting in which more than two outcomes are possible.

Rolling dice seems like a perfect choice!

rolls <- replicate(150, sample(1:6, 1, replace = TRUE))
tibble(rolls = rolls) |> 
  ggplot(aes(x = as.factor(rolls))) + geom_bar() +
  theme(axis.title.x = element_blank())

In the above graph, the bars represent the number of times that each outcome is produced.
If we want frequencies/probabilities instead of counts, we need to divide by the number of draws.

tibble(rolls = rolls) |> 
  ggplot(aes(x = rolls)) + 
  geom_histogram(breaks = 0:6, aes(y = ..density..), color = "white") +
  theme(axis.title.x = element_blank()) +
  scale_x_continuous(breaks = 0:6)

2.3 Example 3: distribution of sizes

Ok but let’s push even further.
Imagine we were to ask every person in one country his or her size (say, in centimeters). If we pick the individual randomly, this makes one draw of a random variable. We can visualize this! We point to three sources:

  • Our World in Data, though the data is quite old (1996) but ok for long-term perspectives;
  • World Data - much more recent & up-to-date (2022);
  • The CDC in the US reports slightly different values.
sizes <- rnorm(150, mean = 178, sd = 7) |> round()
tibble(sizes = sizes) |> ggplot(aes(x = sizes)) + geom_histogram()

There are holes if the sample is small…

But if we ask 100,000 people, we get a much smoother distribution, which, in this case, is the Gaussian distribution, with its bell-shaped density. Sometimes, it is also referred to as the Normal distribution.

sizes <- rnorm(10^5, mean = 178, sd = 7) |> round()
tibble(sizes = sizes) |> ggplot(aes(x = sizes)) + geom_histogram(breaks = 155:205)

2.4 The Gaussian distribution

Let’s spend a bit of time on the properties of this very (very) important distribution.
It is entirely defined by two parameters:

  • its mean (average)
  • its variance (uncertainty)
ggplot() + xlim(-3,3.5) + theme_classic() + 
  geom_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(color = "baseline"), linewidth = 1.2) +
  geom_function(fun = dnorm, args = list(mean = 0, sd = 0.5), aes(color = "less uncertainty")) + 
  geom_function(fun = dnorm, args = list(mean = 1, sd = 1), aes(color = "shifted")) +
  theme(legend.position = c(0.8,0.8),
        axis.title = element_blank(),
        legend.title = element_blank()) +
  scale_color_manual(values = c("#000", "#1166DD", "#FF5511")) +
  geom_vline(xintercept = 0, linetype = 2, color = "#1166DD") +
  geom_vline(xintercept = 1, linetype = 2, color = "#FF5511") +
  geom_segment(aes(x = -0.45, y = 0.5, xend = 0.45, yend = 0.5), color = "#1166DD",
               arrow = arrow(length = unit(0.3, "cm"), ends = "both")) + 
  geom_segment(aes(x = -0.95, y = 0.25, xend = 0.95, yend = 0.25), color = "black",
               arrow = arrow(length = unit(0.3, "cm"), ends = "both")) +
  geom_segment(aes(x = 0, y = 0.8, xend = 1, yend = 0.8), color = "#666666",
               arrow = arrow(length = unit(0.3, "cm"))) +
  annotate("text", x = 0.5, y = 0.82, label = "shift") + 
  annotate("rect", fill = "white", xmin = -0.37, xmax = 0.37, ymin = 0.52, ymax = 0.60) + 
  annotate("text", x = 0, y = 0.55, label = "less\nvariance") + 
  geom_segment(aes(x = 0, y = 0, yend = 0.4, xend = 0), linetype = 1)

In 95%+ of models this is the distribution that is used to generate randomness.

Ok but how can I know if my data is distributed according to a normal law?

The simplest way is to produce a QQ-plot (quantile-versus-quantile) to see if the empirical distribution matches (is close to) the Gaussian one. Below it seems “ok” - with some outliers nonetheless. Another appraoch is simply to display the histogram and see if it is “bell-shaped”.

data.frame(x = rnorm(30)) |>
  ggplot(aes(sample = x)) + stat_qq() + stat_qq_line()

But with an exponential distribution…

set.seed(42)
data.frame(x = rexp(50)) |> ggplot(aes(x = x)) + geom_density()

… not so much! Note that the distribution is standardized beforehand.

set.seed(42)
data.frame(x = rexp(30)) |> ggplot(aes(sample = x)) + stat_qq() + stat_qq_line()

2.5 Multivariate distributions

In fact, random variables may come by pairs, or triplets, or more! For instance, we could imagine to study weight in addition to height!
In this case, representations are more complex of course and an important concept is that of correlation:

\[\text{Cor}(X_1, X_2) = \mathbb{E[(X_1-\mu_1)(X_2-\mu_2)]},\]

where \(\mu_j=\mathbb{E}[X_j]\).

Let us show a (stylized) example below. To see how correlation affects the density, it’s imperative to play with the 3D representation. We increase the correlation on purpose to show its effect.

mu <- c(178, 79)  # mean vector
sigma <- 9*matrix(c(1, 0.7, 0.7, 1), nrow = 2)  # covariance matrix

# Create grid of x and y values
x <- mu[1] + seq(-7, 7, length.out = 100)
y <- mu[2] + seq(-7, 7, length.out = 100)
grid <- expand.grid(x = x, y = y)
z <- dmvnorm(as.matrix(grid), mean = mu, sigma = sigma)
z_matrix <- matrix(z, nrow = length(x), ncol = length(y))

plot_ly(x = ~x, y = ~y, z = ~z_matrix) %>%
  add_surface(colorscale = "Viridis") %>%
  layout(
    scene = list(
      xaxis = list(title = "Height (cm)"),
      yaxis = list(title = "Weight (kgs)"),
      zaxis = list(title = "Density"),
      camera = list(eye = list(x = 1.5, y = -1.5, z = 1.25))
    )
  )

To visualize correlation, the simplest way is to use a scatterplot with one variable on one axis and the other on the other!
The correlation is linked to the slope of the regression line: strong positive (\(resp.\) negative) slope implies positive (\(resp.\) negative) correlation. When the line is roughly flat, the variables do not co-move together: correlation is weak or null.

Z <- rmvnorm(200, mean = mu, sigma = sigma)
data.frame(x = Z[,1], y = Z[,2]) |>
  ggplot(aes(x = x, y = y)) + geom_point() + labs(x = "Height (cm)", y = "Weight (kgs)") +
  geom_smooth(method = "lm")

2.6 Bottomline

Distribution functions

What matters in random variables is the probability of their realizations. The tool that always works is the distribution function.

  • If \(X\) is discrete, then \(P[X \le k]\) characterize the variable (so does \(P[X=k]\)).
  • If \(X\) is continuous, the only \(P[X \le x]\) does. (\(P[X = x]=0\) but $f(x)= P[X x] $ can be defined and serve as density, which is the continuous version of the histogram).
  • If \(X=(X_1, \dots, X_n)\) is multivariate, then look at \(P[X_1 \le x_1, \dots, X_n \le x_n]\).

3 Random processes

Ok so until now, we have covered random variables in such a way that we considered one draw at a time.

There are, however, several ways to combine them and consider higher dimensions:
  • in the cross-section: imagine you are interested in height and weight at the same time. This make 2 numbers. And, of course, they should be related: all other things equal, taller people should be heavier.
  • in the time domain. Imagine you record a given quantity periodically: economic output (GDP) on a quarterly basis; sales of Adidas shoes annually, the value of Bitcoin daily, etc. You also get several values! And these can also follow interesting patterns that you might want to understand… and predict!

Because this is a course on time-series, we will henceforth focus on the latter.

3.1 Definitions & notations

Time-series are random variables indexed by time. Usually, they are written \(x_t\), or \(X_t\) or \(y_t\) or similar notations of this kind. The index \(t\) refers to time. The essential question in time-series is to determine the law of motion of the process, as well as its characteristics.

Most often, the process integrates some form of randomness, which is usually written \(e_t\), \(u_t\), \(\epsilon_t\) or \(\varepsilon_t\). There are two common simplifying assumptions that are made with respect to these quantities, which are sometimes referred to as innovations or residuals:

  1. they are normally distribution (with constant Gaussian law);
  2. they are independent from each other (i.e., \(e_t\) is independent from \(e_{t+1}\)).

This is call the IID assumption. Under gaussian law, this is equivalent to the innovations being uncorrelated.

NOTE: in almost all models, innovations have zero means, \(\mathbb{E}[e_t]=0\), hence in the case of Gaussian innovations, only the variance \(\sigma^2\) is specified. We will work under these (standard) assumptions below. Let’s generate a series of \(e_t\) and plot it below.

N <- 120
sigma <- 0.2
x <- rnorm(N, mean = 0, sd = sigma)
data.frame(t = 1:N, x = x) |>
  ggplot(aes(x = t, y = x)) + geom_line() + theme_minimal()

mean(x)
[1] -0.03540623
sd(x)
[1] 0.2175779

We see that the sample averages are not too far from the parameters we chose (zero mean and 0.2 SD).

3.2 Random walk

The simplest case is the white noise: \[x_t= e_t,\]

but this is not very interesting.

Definition

A second important case is linked to the notion of random walk: \[x_t=x_{t-1}+e_t, \tag{1}\] and, iterating, we see that we have \(x_t=\sum_{s=0}^t e_s\) - where we assume a starting point at \(t=0\).

Notably, we have \[\mathbb{E}[x_t]=\mathbb{E}\left[\sum_{s=0}^t e_s\right]=\sum_{s=0}^t\mathbb{E}\left[ e_s\right]=0.\]

Moreover, if the innovations are independent,

\[\mathbb{V}[x_t]=\mathbb{V}\left[\sum_{s=0}^t e_s\right]=\sum_{s=0}^t\mathbb{V}\left[ e_s\right]=t\sigma^2.\]

Hence, the variance increases with time - linearly. If perturbations are not independent, then it’s a bit more complicated and the variance will depend on all auto-covariances.

The covariance between two points in time is given by (number of overlapping times):

\[\mathbb{C}ov[x_t,x_{t+h}] = \mathbb{E}\left[\sum_{s=0}^t\sum_{r=0}^{t+h} e_se_r\right]=t\sigma^2, \quad h \ge 0.\] This is because all innovation terms are independent, hence the only ones that are non-null are the \(t\) terms of the form \(\mathbb{C}ov[x_s,x_s]=\sigma^2\).

Hence, we see that both the variance and auto-covariance depend on \(t\), say, the present time. This makes analysis complicated, because, depending on the moment we consider, the distribution of \(x_t\) will change.

Nevertheless, note that \[\mathbb{C}ov[x_t,x_{t-h}]=(t-h)\sigma^2\]. So the general formula is \[\mathbb{C}ov[x_t,x_{s}]=\sigma^2\min(s,t)\]

Let’s simulate one trajectory (on the right).

#set.seed(42) # This freezes the random number generator
x <- cumsum(rnorm(N, mean = 0, sd = sigma))
tibble(t = 1:N, x = x) |>
  ggplot(aes(x = t, y = x)) + geom_line() + theme_minimal()

Ok but this is just ONE trajectory. There are an infinite number of possible paths!

#set.seed(42) # This freezes the random number generator
nb_traj <- 10
x <- matrix(rnorm(N*nb_traj, mean = 0, sd = sigma), ncol = nb_traj) |> apply(2, cumsum)
bind_cols(t = 1:N, x) |>
  pivot_longer(-t, names_to = "simulation", values_to = "value") |>
  ggplot(aes(x = t, y = value, color = simulation)) + geom_line() + theme_minimal()

3.3 Unit root tests

Ok, but how can we determine that the process we are observing is a random walk?
Recall that from Equation 1, \(x_t = x_{t-1}+e_t\). This is equivalent to \[x_t = \rho x_{t-1}+e_t, \tag{2}\] but for \(\rho=1\). Hence we can estimate Equation 2 and test if \(\rho=1\) or \(\rho \neq 1\). Alternatively, it is more usual to test if \(\delta=0\) in \[\Delta x_t=x_t-x_{t-1}=\delta x_{t-1}+ e_t.\]

This is referred to as the Dickey-Fuller test. If \(\delta=0\) (or \(\rho=1\)), then we say that there is a unit root.
The test can be generalized to more complex processes, for instance correcting for constant and trends (and seasonality \(s_t\) potentially, too!): \[\Delta x_t=a_0+a_1t +s_t+\delta x_{t-1}+e_t.\]

NOTE

\(\Delta x_t\) is called the difference process.

The augmented Dickey-Fuller test also includes more lags of the difference process in the regression:

\[\Delta x_t=a_0+a_1t+\delta_0x_{t-1}+\sum_{j=1}^J\delta_j \Delta x_{t-j}+e_t.\]

One example below.

adf.test(x[,1])

    Augmented Dickey-Fuller Test

data:  x[, 1]
Dickey-Fuller = -1.0213, Lag order = 4, p-value = 0.9314
alternative hypothesis: stationary

How can we interpret this?

Limitations of the DF test

One of the main issues with the test is that the null (what we test, i.e., \(\rho=1\) ) is synonymous with “*not stationary“. (see just below for more on that…)
This means that when H0 is rejected it means that the series is
not not stationary.
But this does not imply that it
is** stationary.

In fact a likely better test (that incorporates the variance) is the KPSS test.
In this test, the null is “The series is stationary”. The test is more technical than the DF test; it looks at the sum of partial sums of residuals and is focused on variance (which is perhaps why it will work better on our examples below).

kpss.test(x[,1])

    KPSS Test for Level Stationarity

data:  x[, 1]
KPSS Level = 1.2202, Truncation lag parameter = 4, p-value = 0.01

So not a lot of support for H0.
\(\rightarrow\) in this case, both tests agree.

3.4 Ok so what?

Simply put, if the test concludes that there is a unit root, what does that mean?

  • Well, it means that there is not much we can do (the model is simple)…
  • Except perhaps focusing on the properties of the residuals/innovations.

If there is no unit root, then we need to dig deeper and consider alternative (more sophisticated) models.

3.5 A crucial concept: stationarity

There are several definitions of stationarity. The common denominator is the requirement of stability through time. The whole point of time-series is forecasting. Now, to be able to forecast, the analyst needs to have some guarantees that the model she learns or builds will have value in the future. For this to hold, the time-series must show some form of invariance. Indeed, if empirical properties change all the time, then it is impossible to infer durable patterns. A simple example is stock prices. Compare the values of the APPLE or NVIDIA share price in 2000 and in 2025.

tickers <- c("AAPL", "NVDA", "WMT", "XOM")  
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS"
min_date <- "2000-01-01"                      # Starting date
max_date <- "2025-11-08"                      # Ending date
prices <- getSymbols(tickers, src = 'yahoo',  # The data comes from Yahoo Finance
                     from = min_date,         # Start date
                     to = max_date,           # End date
                     auto.assign = TRUE, 
                     warnings = FALSE) %>% 
  map(~Ad(get(.))) %>%                        # Retrieving the data
  purrr::reduce(merge) %>%                    # Merge in one dataframe
  `colnames<-`(tickers)                       # Set the column names
prices |> head(4)
                AAPL       NVDA      WMT      XOM
2000-01-03 0.8392806 0.08942016 14.23962 17.40645
2000-01-04 0.7685208 0.08703261 13.70681 17.07304
2000-01-05 0.7797667 0.08416735 13.42707 18.00380
2000-01-06 0.7122871 0.07867520 13.57360 18.93455

Not exactly the same…

prices |> tail(4)
               AAPL   NVDA    WMT    XOM
2025-11-04 269.7785 198.69 102.27 114.14
2025-11-05 269.8784 195.21 101.47 113.68
2025-11-06 269.5087 188.08 101.68 114.50
2025-11-07 268.2100 188.15 102.59 117.22

Visually, this is easy to see: we cannot study asset prices in 2000-2010 and assume that the model will be relevant in 2025…

prices |> 
  data.frame() |>
  rownames_to_column(var = "date") |>
  ggplot(aes(x = as.Date(date), y = AAPL)) + geom_line() +
  geom_segment(x = as.Date("2000-01-01"), xend = as.Date("2010-01-01"), y = 20, yend = 20, 
               arrow = arrow(ends = "both", length = unit(0.25, "cm")), color = "blue") +
  annotate("text", x = as.Date("2005-01-01"), y = 30, label = "mean < 100", color = "blue", size = 5) +
  geom_segment(x = as.Date("2021-01-01"), xend = as.Date("2026-01-01"), y = 100, yend = 100, 
               arrow = arrow(ends = "both", length = unit(0.25, "cm")), color = "blue") +
  annotate("text", x = as.Date("2023-06-01"), y = 110, label = "mean > 100", color = "blue", size = 5) +  
  theme_minimal() + 
  theme(axis.title = element_blank(),
        text = element_text(size = 16)) 

In finance, raw data is usually not stationary.
But what’s a trick we have seen that can transform a non-stationary series into a stationary one? Inspired by unit root tests…
\(\rightarrow\) Differentiation!

The problem is that differentiation won’t work well because there is a scale issue. If \(p_t \approx1\) or if \(p_t \approx 1,000\), the variations won’t have the same magnitudes. We show this on the right.

prices |> 
  data.frame() |>
  rownames_to_column(var = "date") |>
  ggplot(aes(x = as.Date(date), y = AAPL - lag(AAPL))) + geom_line() +
  geom_segment(x = as.Date("2005-01-01"), xend = as.Date("2023-06-01"), y = 2, yend = 20, 
               arrow = arrow(ends = "both", length = unit(0.25, "cm")), color = "blue") +
  annotate("text", x = as.Date("2012-06-01"), y = 13, label = "increasing magnitude", color = "blue", size = 5, angle = 17) +
  theme_minimal() + 
  theme(axis.title = element_blank(),
        text = element_text(size = 16)) 

Hence, we use relative variations (they are called returns in finance):

\[r_t = \frac{p_t-p_{t-1}}{p_{t-1}}.\]

And magnitudes remain roughly comparable throughout the sample.

prices |> 
  data.frame() |>
  rownames_to_column(var = "date") |>
  ggplot(aes(x = as.Date(date), y = AAPL/lag(AAPL)-1)) + geom_line() +
  theme_minimal() + 
  theme(axis.title = element_blank(),
        text = element_text(size = 16)) 

Now let’s turn to definitions of stationarity. The most strict formulation is the following.

Strict stationarity

A stochastic process \(\{X_t\}_{t \in T}\) is said to be if for any finite collection of time points \(t_1, t_2, \ldots, t_n \in T\) and any time shift \(h\), the joint distribution of \((X_{t_1}, X_{t_2}, \ldots, X_{t_n})\) is identical to the joint distribution of \((X_{t_1+h}, X_{t_2+h}, \ldots, X_{t_n+h})\).

Mathematically, this means, in terms of distribution functions: \[F_{X_{t_1}, X_{t_2}, \ldots, X_{t_n}}(x_1, x_2, \ldots, x_n) = F_{X_{t_1+h}, X_{t_2+h}, \ldots, X_{t_n+h}}(x_1, x_2, \ldots, x_n)\] for all \(n \in \mathbb{N}\), all \(t_1, \ldots, t_n, h \in T\), and all \(x_1, \ldots, x_n \in \mathbb{R}\).

This definition is a bit tough and not very practical. It’s hard to test and not necessarily useful.
Focusing on the (much simpler) first two (co-)moments is easier and workable.

Weak stationarity

A stochastic process \(\{X_t\}_{t \in T}\) is said to be (or or ) if it satisfies the following three conditions:

  • The mean function is constant: \(\mathbb{E}[X_t] = \mu \quad \text{for all } t \in T\);
  • The variance is finite: \(\text{Var}(X_t) = \mathbb{E}[(X_t - \mu)^2] < \infty \quad \text{for all } t \in T\);
    The autocovariance function depends only on the time lag: \(\text{Cov}(X_t, X_{t+h}) = \gamma(h) \quad \text{for all } t, h \in T\),

where \(\gamma(h) = \mathbb{E}[(X_t - \mu)(X_{t+h} - \mu)]\) is the autocovariance function at lag \(h\).
In particular, the variance is constant through time, too.

Assuming series are detrended (and de-seasonalized), the Dickey-Fuller test provides a way to determine if series are stationary or not.

4 Application to real data

4.1 Airline traffic

We recycle the code from the previous session.

url <- "https://github.com/shokru/coqueret.github.io/raw/refs/heads/master/files/misc/time_series/traffic-sheet.xlsx"
airline <- read.xlsx(url, sheet = 3, startRow = 2)
air_data <- airline |> select(1:3)                  # Keep only the first 3 columns
colnames(air_data) <- c("date", "CDG", "ORY")       # Rename these columns
air_data <- air_data |>                             # Reformat the date
  mutate(date = as.Date(as.numeric(date), origin = "1899-12-30")) |>
  filter(is.finite(date), is.finite(CDG)) |>
  arrange(date) |>
  mutate(date = yearmonth(date)) |>
  distinct(date, .keep_all = T) |>
  as_tsibble() |>
  fill_gaps()
air_data |> head(4)
# A tsibble: 4 x 3 [1M]
      date     CDG     ORY
     <mth>   <dbl>   <dbl>
1 2000 Jan 3223328 1935261
2 2000 Feb 3289676 1942750
3 2000 Mar 3891206 2204640
4 2000 Apr 4221430 2266448

But we first need to remove the seasonality component.

decomp <- stl(air_data[,-3], s.window="periodic")
CDG_sa <- seasadj(decomp)
bind_cols(air_data[,-3], CDG_sa = CDG_sa) |>
  pivot_longer(-date, names_to = "series", values_to = "passengers") |>
  ggplot(aes(x = date, y = passengers, color = series)) + geom_line() + 
  theme_classic() + 
  theme(legend.position = c(0.2,0.2))

Ok so now we can move on and test for stationarity.
NOTE: by default, the test adds a constant and a linear trend.

adf.test(x = CDG_sa)

    Augmented Dickey-Fuller Test

data:  CDG_sa
Dickey-Fuller = -3.2265, Lag order = 6, p-value = 0.0838
alternative hypothesis: stationary

The result is not super clean cut…

  • H0: not stationary
  • H1: stationary

Small p-value: “evidence” against the null… (The probability, under the null H0, to observe data as extreme as the one we have, is weak.)

With KPSS:

kpss.test(x = CDG_sa)

    KPSS Test for Level Stationarity

data:  CDG_sa
KPSS Level = 0.47494, Truncation lag parameter = 5, p-value = 0.04731

So here the test rejects the null!

4.2 Carbon concentration

Same but with CO2 data.

url <- "https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_mlo.csv"
co2 <- read.csv(url, skip = 40)
co2 <- co2 |> 
  mutate(date = make_date(year = year, month = month, day = 15)) |>
  select(date, average, deseasonalized) |>
  mutate(date_iso = date,
         date = yearmonth(date)) |>
  as_tsibble(index = date)
adf.test(x = co2$deseasonalized)

    Augmented Dickey-Fuller Test

data:  co2$deseasonalized
Dickey-Fuller = 0.1593, Lag order = 9, p-value = 0.99
alternative hypothesis: stationary

Potential problem here: the “drift” may not be linear, but quadratic…

kpss.test(x = co2$deseasonalized)

    KPSS Test for Level Stationarity

data:  co2$deseasonalized
KPSS Level = 11.499, Truncation lag parameter = 6, p-value = 0.01

The test again rejects the null of stationarity.

4.3 Finance

So for the sake of obviousness, let’s run a DF test on NVDA prices… (The data was loaded above)

nvda <- prices[,2]
adf.test(nvda)

    Augmented Dickey-Fuller Test

data:  nvda
Dickey-Fuller = 4.0017, Lag order = 18, p-value = 0.99
alternative hypothesis: stationary

That was to be expected. We cannot reject the null of no stationarity.

kpss.test(nvda)

    KPSS Test for Level Stationarity

data:  nvda
KPSS Level = 18.826, Truncation lag parameter = 11, p-value = 0.01

No big surprise: the tests agree. Though in this case, we reject the null of stationarity.

Next, what about simple differences?

adf.test(na.omit(nvda - lag(nvda)))

    Augmented Dickey-Fuller Test

data:  na.omit(nvda - lag(nvda))
Dickey-Fuller = -18.793, Lag order = 18, p-value = 0.01
alternative hypothesis: stationary

So we must reject the null of no-stationarity. Here, the problem is that the DF test implies that there is no unit root.
But this does not mean that the series are stationary either! That’s an important subtlelty.
Let’s look at the KPSS outcome.

kpss.test(na.omit(nvda - lag(nvda)))

    KPSS Test for Level Stationarity

data:  na.omit(nvda - lag(nvda))
KPSS Level = 1.3816, Truncation lag parameter = 11, p-value = 0.01

Et voilà! Here the tests “disagree” with regard to stationarity. Though, again, DF does not really test for stationarity strictly speaking.
But it is KPSS that is “correct”.

Now let’s turn to returns…

adf.test(na.omit(nvda/lag(nvda) - 1))

    Augmented Dickey-Fuller Test

data:  na.omit(nvda/lag(nvda) - 1)
Dickey-Fuller = -17.683, Lag order = 18, p-value = 0.01
alternative hypothesis: stationary
kpss.test(na.omit(nvda/lag(nvda) - 1))

    KPSS Test for Level Stationarity

data:  na.omit(nvda/lag(nvda) - 1)
KPSS Level = 0.16278, Truncation lag parameter = 11, p-value = 0.1

Here the result for KPSS is not super clear cut, but it does seem that the two test point towards stationarity.
FINALLY !!!!

To wrap up:

test prices simple difference returns
DF test not stationary stationary stationary
KPSS test not stationary not stationary stationary

To close this session, let’s recall a typical representation of returns (quite abstract).
Below, the returns of Apple.

prices |> 
  data.frame() |>
  rownames_to_column(var = "date") |>
  ggplot(aes(x = as.Date(date), y = AAPL/lag(AAPL)-1)) + geom_line() +
  theme_minimal() + 
  theme(axis.title = element_blank(),
        text = element_text(size = 16)) 

And below, a plot of simulations of white noise. Can you spot a difference?

data.frame(t = 1:(100*N), x = rnorm(100*N)/10) |>
  ggplot(aes(x = t, y = x)) + geom_line()

4.4 Exercise: reproduce these tests on crypto data!